This submission is my work alone and complies with the 30535 integrity policy. Hye-Min Jung
Collaborators: Boseong Yun
Late coins used this pset: 1. Late coins left: 8.
library(R.cache)
library(styler)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(nycflights13)
flights <- nycflights13::flights
I’ve adhere to style guide using styler, using “Style active file” and “Style selection” from the Addin menu.
1-a. Order flights by departing airport, arrival airport, day, and scheduled departure time. For each flight, use lag() and group_by() to compute the delay on the previous flight – if there is such a flight on the same day. * Data frame named lagged_flights is output for 1-a. * First, I ordered flights with origin, dest, year, month, day, sched_dep_time. These columns are selected becasue, intuitively delay that cause problem to later flights will be in the same year, month, day and origin. + (In the first step, I included also origin because question asked to include dest when ordering. But I would personally prefer excluding dest for ordering, since the destination will not directly related to the delay compare to the other factors.) * Second, I lagged delay by previous scheduled flight within a day.
not_cancelled <- filter(flights, !is.na(dep_delay), !is.na(arr_delay))
(reordered_flights <-
not_cancelled %>%
select(origin, dest, sched_dep_time, year, month, everything()) %>%
group_by(origin, dest, year, month, day, sched_dep_time) %>%
arrange(origin, dest, year, month, day, sched_dep_time))
## # A tibble: 327,346 x 19
## # Groups: origin, dest, year, month, day, sched_dep_time [322,175]
## origin dest sched_dep_time year month day dep_time dep_delay
## <chr> <chr> <int> <int> <int> <int> <int> <dbl>
## 1 EWR ALB 1317 2013 1 1 1315 -2
## 2 EWR ALB 1621 2013 1 1 1655 34
## 3 EWR ALB 2004 2013 1 1 2056 52
## 4 EWR ALB 1327 2013 1 2 1332 5
## 5 EWR ALB 1621 2013 1 2 1746 85
## 6 EWR ALB 2004 2013 1 2 2148 104
## 7 EWR ALB 1619 2013 1 3 1716 57
## 8 EWR ALB 2038 2013 1 3 2031 -7
## 9 EWR ALB 1619 2013 1 4 1618 -1
## 10 EWR ALB 2000 2013 1 4 2031 31
## # … with 327,336 more rows, and 11 more variables: arr_time <int>,
## # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
(lagged_flights <-
reordered_flights %>%
group_by(day) %>%
mutate(previous_dep_delay = lag(dep_delay)) %>%
select(previous_dep_delay, dep_delay, sched_dep_time, origin, dest, year, month, everything()))
## # A tibble: 327,346 x 20
## # Groups: day [31]
## previous_dep_de… dep_delay sched_dep_time origin dest year month day
## <dbl> <dbl> <int> <chr> <chr> <int> <int> <int>
## 1 NA -2 1317 EWR ALB 2013 1 1
## 2 -2 34 1621 EWR ALB 2013 1 1
## 3 34 52 2004 EWR ALB 2013 1 1
## 4 NA 5 1327 EWR ALB 2013 1 2
## 5 5 85 1621 EWR ALB 2013 1 2
## 6 85 104 2004 EWR ALB 2013 1 2
## 7 NA 57 1619 EWR ALB 2013 1 3
## 8 57 -7 2038 EWR ALB 2013 1 3
## 9 NA -1 1619 EWR ALB 2013 1 4
## 10 -1 31 2000 EWR ALB 2013 1 4
## # … with 327,336 more rows, and 12 more variables: dep_time <int>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
1-b. Make a plot which shows the relationship between a flight’s delay and the delay of the immediately preceding scheduled flight. You have a lot of data, so think carefully about how to develop a plot which is not too cluttered.
# install.packages("hexbin")
library(hexbin)
ggplot(data = lagged_flights) +
geom_hex(mapping = aes(x = previous_dep_delay, y = dep_delay)) +
labs(
title = "Relationship between flight's delay with preceding flight's delay (min)",
x = "Previous flight's delay",
y = "Flight's delay"
)
## Warning: Removed 31 rows containing non-finite values (stat_binhex).
ggplot(data = lagged_flights) +
geom_bin2d(mapping = aes(x = previous_dep_delay, y = dep_delay)) +
labs(
title = "Relationship between flight's delay with preceding flight's delay (min)",
x = "Previous flight's delay",
y = "Flight's delay"
)
## Warning: Removed 31 rows containing non-finite values (stat_bin2d).
ggplot(data = lagged_flights) +
geom_point(mapping = aes(x = previous_dep_delay, y = dep_delay), alpha = 1 / 100) +
labs(
title = "Relationship between flight's delay with preceding flight's delay (min)",
x = "Previous flight's delay",
y = "Flight's delay"
)
## Warning: Removed 31 rows containing missing values (geom_point).
2. Now we will look at delays that occur in the air. We will need a sense of how long a flight is.
2-a. Compute the air time for each flight relative to the median flight to that destination. Which flights were most delayed in the air? + US468 was most delayed flight in the air. + Computing median air time: I computed median air time for each destinations by using group_by and created column median_air_time by mutate. + Computing relative air time: I differentiated air_time with median air time and created column median_air_time and created column relative_air_time using mutate. + Finding flights most delayed in the air: Using carrier+flight as identifier of the each flight, I grouped by carrier and flight and calculated mean for relative air time. Rearranging the average delay for each flights, I found US468 was most delayed flight in the air.
(median_included_flights <-
not_cancelled %>%
select(dest, air_time, tailnum, everything()) %>%
group_by(dest) %>%
mutate(median_air_time = median(air_time)) %>%
select(dest, air_time, median_air_time, tailnum, everything())) %>%
arrange(dest)
## # A tibble: 327,346 x 20
## # Groups: dest [104]
## dest air_time median_air_time tailnum year month day dep_time
## <chr> <dbl> <dbl> <chr> <int> <int> <int> <int>
## 1 ABQ 230 246 N554JB 2013 10 1 1955
## 2 ABQ 238 246 N607JB 2013 10 2 2010
## 3 ABQ 251 246 N591JB 2013 10 3 1955
## 4 ABQ 257 246 N662JB 2013 10 4 2017
## 5 ABQ 242 246 N580JB 2013 10 5 1959
## 6 ABQ 240 246 N507JB 2013 10 6 1959
## 7 ABQ 246 246 N565JB 2013 10 7 2002
## 8 ABQ 233 246 N775JB 2013 10 8 1957
## 9 ABQ 236 246 N554JB 2013 10 9 1957
## 10 ABQ 245 246 N594JB 2013 10 10 2011
## # … with 327,336 more rows, and 12 more variables: sched_dep_time <int>,
## # dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, origin <chr>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
median_included_flights %>%
mutate(relative_air_time = air_time - median_air_time) %>%
select(dest, air_time, median_air_time, relative_air_time, everything()) %>%
group_by(carrier, flight) %>%
summarise(average_delay = mean(relative_air_time)) %>%
arrange(desc(average_delay))
## # A tibble: 5,706 x 3
## # Groups: carrier [16]
## carrier flight average_delay
## <chr> <int> <dbl>
## 1 US 468 62
## 2 DL 513 44
## 3 WN 2335 44
## 4 WN 3018 44
## 5 WN 819 41
## 6 WN 2529 40
## 7 WN 4859 40
## 8 WN 688 39
## 9 WN 1581 38
## 10 WN 3294 37
## # … with 5,696 more rows
flights %>%
group_by(tailnum) %>%
arrange(time_hour) %>%
mutate(
cum = arr_delay > 60,
cum_any = cumsum(cum)
) %>%
filter(cum_any < 1) %>%
tally(sort = TRUE)
## # A tibble: 3,744 x 2
## tailnum n
## <chr> <int>
## 1 N705TW 97
## 2 N765US 97
## 3 N12125 94
## 4 N320AA 94
## 5 N13110 91
## 6 N3763D 82
## 7 N58101 82
## 8 N17122 81
## 9 N961UW 80
## 10 N950UW 79
## # … with 3,734 more rows
price.ggplot(data = diamonds) +
geom_bar(aes(x = price))
1-a. Describe the overall pattern. Does it fit your expectations?
+ The distribution of diamond price has extremely long right tail. It corresponds to my expectation, in a sense that the diamond price range is diverse and spans all the way towards extremely expensive prices.
1-b. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
+ Unsually sudden drop was detected when I specfied bindwidth = 100. So I subset the data with price<2500. Looking closer with subset of data, I could identify empty bin in price range around 1500.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = price), binwidth = 100) +
labs(title = "Distribution of diamonds price")
## Warning: `geom_bar()` no longer has a `binwidth` parameter. Please use
## `geom_histogram()` instead.
diamonds_cheap <-
diamonds %>%
filter(price < 2500)
ggplot(data = diamonds_cheap) +
geom_bar(aes(x = price)) +
labs(title = "Distribution of diamonds price less than 2500")
carat.ggplot(data = diamonds) +
geom_bar(aes(x = carat)) +
labs(title = "Distribution of carat")
diamonds_small <-
diamonds %>%
filter(carat < 2.1)
ggplot(data = diamonds_small, aes(x = carat)) +
geom_histogram(binwidth = 0.01) +
labs(title = "Distritution of carat less than 2.1")
ggplot(data = diamonds_small) +
geom_point(aes(x = carat, y = price)) +
labs(title = "Carat and price relationship")
ggplot(data = diamonds_small, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1))) +
labs(title = "Carat and price relationship in quantiles")
2-a. Describe the overall pattern. Does it fit your expectation given what you saw in prices?
There are decreasing trend as carat increases with cyclical repetition with approximately 0.5 carat binwidth.
It is beyond my expectation because, price distribution did not exhibit repetitive pattern as carat is.
2-b. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
23, 1558 for 0.99 and 1 carat diamonds respectively.
This difference can be coming from the market preference towards 1 carat diamonds than 0.99. Because producers don’t produce 0.99 as much as 1 carat due to the diamonds standards. Also for customers, wouldn’t look for 0.99 carat since same price range 1 carat diamonds exist and is offered with more variety in the market.
resource : https://www.quora.com/Should-I-buy-0-99-carat-diamond-or-1-carat-diamond
diamonds %>%
count(carat)
## # A tibble: 273 x 2
## carat n
## <dbl> <int>
## 1 0.2 12
## 2 0.21 9
## 3 0.22 5
## 4 0.23 293
## 5 0.24 254
## 6 0.25 212
## 7 0.26 253
## 8 0.27 233
## 9 0.28 198
## 10 0.290 130
## # … with 263 more rows
diamonds %>%
group_by(carat) %>%
summarise(ave_price = mean(price))
## # A tibble: 273 x 2
## carat ave_price
## <dbl> <dbl>
## 1 0.2 365.
## 2 0.21 380.
## 3 0.22 391.
## 4 0.23 486.
## 5 0.24 505.
## 6 0.25 551.
## 7 0.26 551.
## 8 0.27 575.
## 9 0.28 580.
## 10 0.290 601.
## # … with 263 more rows
If I leave binwidth unset, coord_cartesian() and xlim()both sets stat_bin() using bins = 30. But I change binwidth using binwidth.
coord_cartesian(): Setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will. This command doesn’t eliminate any data points but simply zooms the plot to the desired region.xlim() or ylim(): removes rows containing non-finite values (stat_bin) and rows containing missing values (geom_bar).xlim, ylim removes data points, whereas the coord_cartesian simply zooms the plots. This causes trouble when ggplot2 is using one of the stat_ functions to compute something (such as a smoothed fit to data, a density, or a contour) from the underlying data before plotting.Also note how xlim and ylim inside coord_cartesian don’t exclude the data.
If I try to zoom in on half a bar, coord_cartesian() does not show any empty bin, whereas xlim() or ylim() and coord_cartesian(xlim = c(), ylim = c()) has empty bin.
resource: http://rstudio-pubs-static.s3.amazonaws.com/209392_437ec4da7fa2432d831320f3591e7491.html
ggplot(data = diamonds, aes(x = price)) +
geom_histogram() +
coord_cartesian(xlim = c(0, 10000), ylim = c(0, 10000)) +
labs(title = "Histogram of diamond price(coord_cartesian $5000~$10000)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram() +
xlim(c(0, 10000)) +
ylim(c(0, 10000)) +
labs(title = "Histogram of diamond price(xlim, ylim $5000~$10000)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5222 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 30) +
coord_cartesian(xlim = c(0, 10000), ylim = c(0, 10000)) +
labs(title = "Histogram of diamond price(coord_cartesian, xlim, ylim, binwidth)")
# zoom half a bar
ggplot(data = diamonds, aes(x = price)) +
geom_histogram() +
coord_cartesian(xlim = c(0, 10000), ylim = c(0, 5000)) +
labs(title = "Zoomed histogram of diamond price(coord_cartesian $5000~$10000)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram() +
xlim(c(0, 10000)) +
ylim(c(0, 5000)) +
labs(title = "Zoomed histogram of diamond price(xlim, ylim $5000~$10000)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5222 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 30) +
coord_cartesian(xlim = c(0, 10000), ylim = c(0, 5000)) +
labs(title = "Zoomed histogram of diamond price(coord_cartesian, xlim, ylim, binwidth)")
corrplot.lm(price ~ ., data = diamonds)
##
## Call:
## lm(formula = price ~ ., data = diamonds)
##
## Coefficients:
## (Intercept) carat cut.L cut.Q cut.C
## 5753.762 11256.978 584.457 -301.908 148.035
## cut^4 color.L color.Q color.C color^4
## -20.794 -1952.160 -672.054 -165.283 38.195
## color^5 color^6 clarity.L clarity.Q clarity.C
## -95.793 -48.466 4097.431 -1925.004 982.205
## clarity^4 clarity^5 clarity^6 clarity^7 depth
## -364.918 233.563 6.883 90.640 -63.806
## table x y z
## -26.474 -1008.261 9.609 -50.119
(numeric_diamonds <- diamonds %>%
select(carat, depth, table, x, y, z, price))
## # A tibble: 53,940 x 7
## carat depth table x y z price
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 0.23 61.5 55 3.95 3.98 2.43 326
## 2 0.21 59.8 61 3.89 3.84 2.31 326
## 3 0.23 56.9 65 4.05 4.07 2.31 327
## 4 0.290 62.4 58 4.2 4.23 2.63 334
## 5 0.31 63.3 58 4.34 4.35 2.75 335
## 6 0.24 62.8 57 3.94 3.96 2.48 336
## 7 0.24 62.3 57 3.95 3.98 2.47 336
## 8 0.26 61.9 55 4.07 4.11 2.53 337
## 9 0.22 65.1 61 3.87 3.78 2.49 337
## 10 0.23 59.4 61 4 4.05 2.39 338
## # … with 53,930 more rows
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(numeric_diamonds), method = "number")
ggplot(data = diamonds) +
geom_point(aes(x = carat, y = price)) +
labs(title = "Carat size & diamond price relationship")
ggplot(data = diamonds) +
geom_point(aes(x = depth, y = price)) +
labs(title = "Depth & diamond price relationship")
ggplot(data = diamonds) +
geom_point(aes(x = table, y = price)) +
labs(title = "Table & diamond price relationship")
ggplot(data = diamonds) +
geom_point(aes(x = x, y = price)) +
labs(title = "x(length in mm) & diamond price relationship")
ggplot(data = diamonds) +
geom_point(aes(x = y, y = price)) +
labs(title = "y(width in mm) & diamond price relationship")
ggplot(data = diamonds) +
geom_point(aes(x = z, y = price)) +
labs(title = "z(depth in mm) & diamond price relationship")
ggplot(data = diamonds) +
geom_boxplot(aes(x = cut, y = price)) +
labs(title = "Cut & diamond price relationship")
ggplot(data = diamonds) +
geom_boxplot(aes(x = color, y = price)) +
labs(title = "Color & diamond price relationship")
ggplot(data = diamonds) +
geom_boxplot(aes(x = clarity, y = price)) +
labs(title = "Clarity & diamond price relationship")
ggplot(data = diamonds, aes(carat, fill = cut)) +
geom_histogram(binwidth = 0.05) +
labs(title = "Frequency of diamond cut by carat")
small_carat <-
diamonds %>%
filter(carat < 3)
ggplot(data = small_carat, aes(carat, fill = cut)) +
geom_histogram(binwidth = 0.05) +
labs(title = "Frequency of diamond cut by carat smaller than 3")
ggplot(small_carat, aes(carat, colour = cut)) +
geom_freqpoly(binwidth = 0.05) +
labs(title = "Distribution of diamond cut by carat smaller than 3")
ggplot(data = diamonds) +
geom_boxplot(aes(x = reorder(cut, carat, FUN = median), y = carat)) +
labs(title = "Carat and cut relationship (reordered by median)", x = "cut", y = "carat")
Ideal is most common cut in every color category, as you can see in the plot below.ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = cut), position = "fill") +
labs(title = "Proportion of cuts within color categories", x = "Color", y = "Proportions of cuts")
2-b. Repeat the exercise again to show distribution of colour within cut.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = color, fill = cut), position = "dodge") +
labs(title = "Distribution of cut within color categories", x = "Color", y = "Cut distributions")
2-c. Using the dataframe you just produced as input, reproduce the following graph. + resource : https://stackoverflow.com/questions/20226487/how-to-annotate-geom-bar-above-bars
ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = color, y = stat(prop), group = 1),
show.legend = FALSE
) +
theme(aspect.ratio = 1) +
facet_wrap(~cut) +
labs(x = "color", y = "prop")
3-a. Make a frequency polygon using cut_width() and another using cut_number(). Adjust the parameters until you think the graphs are most useful. + resource : https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
ggplot(data = diamonds, mapping = aes(x = price, colour = cut_width(carat, 1))) +
geom_freqpoly() +
labs(title = "Distribution of diamond price by carat")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds, mapping = aes(x = price, colour = cut_number(carat, 10))) +
geom_freqpoly() +
labs(title = "Distribution of diamond price by carat")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
3-b. How does this affect your visualisation of the joint distribution of carat and price and your ability to interpret them? +
Cut_width sets bin width of carats whereas cut_number patitions carat into specified number of bins. + This gives different visualizations, unless you equalize the parameters used in two plots to give exactly same sets of the range and the number of bins. + Therefore, two plots are interpreted differently; I would interpret cut_width(carat, 1) freqpoly plot to explain ‘distributional change in diamonds price with a unit increase in carat size’. Whereas, for cut_number(carat, 0.5) freqpoly graph, I will interpret it as `10 different levels of carat size and the distributions of diamonds price’.
ggplot(data = diamonds, mapping = aes(x = carat, colour = cut_number(price, 10))) +
geom_freqpoly() +
labs(title = "Distribution of carat, partitioned by price.")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = carat, y = cut_number(price, 10))) +
labs(title = "Conditional distribution of carat by price (binned into 10 price range)")
ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = carat),
show.legend = FALSE
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL) +
facet_wrap(~cut) +
labs(title = "'Diamond price vs. carat size' by 5 cut categories", x = "Carat", y = "Price")
geom_point is better for identifying outliers. With geom_point, I can easily distinguish suspicious observations that are away from other clustered observations, because all data observations represented as single points.geom_bin2d is not as easy as geom_point when looking for outlier observations. We can know the population of each rectangular areas, however, it is difficult to pinpoint points that are plotted abnormally away from normal observations. Because it divides the plane into rectangles, counts the number of cases in each rectangle, and then (by default) maps the number of cases to the rectangle’s fill. Although, geom_bind2d is plot alternatively used for geom_point, geom_point is better when looking for outliers.ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11)) +
labs(title = "1st geom plot for identifying outliers (better)")
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11)) +
labs(title = "2nd geom plot for identifying outliers")